Extending LLM usage for PDFs when the extracted text is empty after pdfminer by gjmveloso · Pull Request #1285 · microsoft/markitdown

gjmveloso · 2025-06-06T23:16:46Z

Initial work to attempt to use LLM to perform OCR operations within a PDF when pdfminer returns empty text

gjmveloso · 2025-06-06T23:19:17Z

@microsoft-github-policy-service agree

dillonstreator · 2025-06-10T21:50:05Z

packages/markitdown/src/markitdown/converters/_pdf_converter.py

+                    prompt=llm_prompt,
+                )
+
+        return DocumentConverterResult(markdown=str(markdown))


There is an issue of PDFs containing both mineable text and images that contain text. It would be nice to have a more sophisticated branching mechanism that accounts for this and/or allowing an API to override by the markitdown caller.

Are you thinking on something like replacing the usage of extract_text with extract_pages and iterate over its non-text elements, like LTImage and LTFigure?

Layout system reference:
https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#topic-pdf-to-text-layout

Yes - that would allow a much more reliable, predictable, and comprehensive text extraction.

Finally, done. Thoughts?

…pdfminer

- Proper handling of file_stream positioning after an empty result from pdfminer

- Resolve merge conflicts that were baked into the previous commits - Add llm_caption import and two prompt constants (_PDF_IMAGE_LLM_PROMPT, _PDF_FULL_LLM_PROMPT) to avoid inline prompt strings - Add _collect_lt_images() and _get_lt_image_data() helpers for extracting JPEG/JPEG2000 image data from pdfminer LTImage objects; use pdfminer's own LITERALS_DCT_DECODE / LITERALS_JPX_DECODE for filter comparison instead of fragile PSLiteral string conversion - When no form pages are detected, use pdfminer extract_text for prose quality, then do a second pass with extract_pages to find LTFigure elements containing embedded images and caption each one via the LLM - Add last-resort whole-document LLM fallback for fully non-searchable PDFs where no captionable images were found - Guard _merge_partial_numbering_lines call against None return from llm_caption

gjmveloso changed the title ~~Extending LLM usage for PDFs where the extracted text was empty with pdfminer~~ Extending LLM usage for PDFs when the extracted text is empty after pdfminer Jun 6, 2025

gjmveloso mentioned this pull request Jun 6, 2025

Add OCR fallback for scanned/non-searchable PDFs (#1156) #1268

Open

gjmveloso marked this pull request as draft June 9, 2025 19:38

gjmveloso marked this pull request as ready for review June 9, 2025 22:08

dillonstreator reviewed Jun 10, 2025

View reviewed changes

stefan-rink mentioned this pull request Jul 10, 2025

FEAT: Image recognition in PDF files #1318

Open

gjmveloso added 5 commits April 7, 2026 12:40

Extending LLM usage for PDFs where the extracted text was empty with …

ed0de82

…pdfminer

Improving prompt to avoid translation/generation of new content

64cc04b

Improving prompt to avoid translation/generation of new content

c09d5c7

- Prompt improvements for non-Gemini models

a4c6046

- Proper handling of file_stream positioning after an empty result from pdfminer

gjmveloso force-pushed the feat/pdf_fallback_with_llm branch from c83bacc to 6742995 Compare April 7, 2026 18:43

gjmveloso requested a review from dillonstreator April 7, 2026 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending LLM usage for PDFs when the extracted text is empty after pdfminer#1285

Extending LLM usage for PDFs when the extracted text is empty after pdfminer#1285
gjmveloso wants to merge 5 commits intomicrosoft:mainfrom
gjmveloso:feat/pdf_fallback_with_llm

gjmveloso commented Jun 6, 2025

Uh oh!

gjmveloso commented Jun 6, 2025

Uh oh!

dillonstreator Jun 10, 2025

Uh oh!

gjmveloso Jun 11, 2025

Uh oh!

dillonstreator Jun 11, 2025

Uh oh!

gjmveloso Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gjmveloso commented Jun 6, 2025

Uh oh!

gjmveloso commented Jun 6, 2025

Uh oh!

dillonstreator Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

gjmveloso Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

dillonstreator Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

gjmveloso Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants